DAND P5 - Quality Red Wine by Kangsan Kim

The dataset I’ve decided to explore in this project is Red Wine data. The dataset contains information on the different red wine characteristics such as acidity, sugar, pH, and alcohol%.

Univariate Plots Section

##       X fixed.acidity volatile.acidity citric.acid residual.sugar
## 46   46           4.6             0.52        0.15            2.1
## 96   96           4.7             0.60        0.17            2.3
## 132 132           5.6             0.50        0.09            2.3
## 133 133           5.6             0.50        0.09            2.3
## 143 143           5.2             0.34        0.00            1.8
## 145 145           5.2             0.34        0.00            1.8
##     chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 46      0.054                   8                   65  0.9934 3.90
## 96      0.058                  17                  106  0.9932 3.85
## 132     0.049                  17                   99  0.9937 3.63
## 133     0.049                  17                   99  0.9937 3.63
## 143     0.050                  27                   63  0.9916 3.68
## 145     0.050                  27                   63  0.9916 3.68
##     sulphates alcohol quality
## 46       0.56    13.1       4
## 96       0.60    12.9       6
## 132      0.63    13.0       5
## 133      0.63    13.0       5
## 143      0.79    14.0       6
## 145      0.79    14.0       6

This first table is a quick view of a list of wines that have an alcohol content of greater than 11%.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

This summary shows some introductory information about each column and the range of values they carry.

## Warning: Removed 4 rows containing non-finite values (stat_bin).

This chart explores the free sulfur dioxide content amongst the dataset. Here, we can see that the vast majority has a free sulfur dioxide content of less than 20.

## Warning: Removed 2 rows containing non-finite values (stat_bin).

This chart explores the total sulfur dioxide content amongst the dataset. Here, we can see that the vast majority has a total sulfur dioxide content of less than 50.

## Warning: Removed 1 rows containing missing values (geom_bar).

This plot shows the number of wines by pH level, and it appears most wines are between a pH of 3 and 3.5.

## Warning: Removed 71 rows containing non-finite values (stat_bin).

This plot shows the count of wines per density, ranging from 0.99 to 1, showing in bins of 0.001.

## Warning: Removed 21 rows containing non-finite values (stat_bin).

This chart explores volatile acidity in the dataset. It appears almost like a normal distribution!

This plot shows the amount of wines for each fixed acidity. We can see that the majority of the wines seem to be between 7 and 8.

## Warning: Removed 1 rows containing non-finite values (stat_bin).

This shows the number of wines throughout the different levels of alcohol content. The overwhelming majority of wines are between 9% and 10%.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This histogram shows a quick summary of the number of wines in each quality rating. We can see that there is the majority of wines are rated at a 5 or 6. I wonder what it takes to receive a rating of 8?

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1430 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

The plot above shows the majority of wines with a chloride level between 0.1 and 0.12.

## Warning: Removed 8 rows containing non-finite values (stat_bin).

It appears that most of the wine has a sulphate level between 0.4 and 0.8.

## redWine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.875   2.100   2.635   3.100   5.700 
## -------------------------------------------------------- 
## redWine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   1.900   2.100   2.694   2.800  12.900 
## -------------------------------------------------------- 
## redWine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.900   2.200   2.529   2.600  15.500 
## -------------------------------------------------------- 
## redWine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.477   2.500  15.400 
## -------------------------------------------------------- 
## redWine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   2.000   2.300   2.721   2.750   8.900 
## -------------------------------------------------------- 
## redWine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   1.800   2.100   2.578   2.600   6.400

This is a summary of sugar content between the different quality ratings. Here, we can see that the average appears to be about the same within each quality rating, and does not appear to have a direct effect on the rating itself.

## redWine$quality: 3
## [1] 10532
## -------------------------------------------------------- 
## redWine$quality: 4
## [1] 42240
## -------------------------------------------------------- 
## redWine$quality: 5
## [1] 505290
## -------------------------------------------------------- 
## redWine$quality: 6
## [1] 540656
## -------------------------------------------------------- 
## redWine$quality: 7
## [1] 165601
## -------------------------------------------------------- 
## redWine$quality: 8
## [1] 14881

This is just a count of the number of wines in each quality rating

## redWine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## redWine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## redWine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## redWine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## redWine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## redWine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

This shows a numerical summary of the alcohol content sorted by the different quality levels

Univariate Analysis

What is the structure of your dataset?

The structure of my dataset are numerical analysis of the data, along with bar charts that explore one variable at a time.

What is/are the main feature(s) of interest in your dataset?

The main interest in my dataset is to explore the relationship between certain characteristics of the wine in relation to the quality rating.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The overall numerical analysis can help support my investigation into the relationship between characteristics of wine and its quality.

Did you create any new variables from existing variables in the dataset?

For this dataset, I did not create any new variables.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Through my analysis, I did not notice any unusual distributions. This was a clean and tidy dataset of wines.

Bivariate Plots Section

This plot explores the different levels of alcohol content in the different quality ratings. Here we can see that the median alcohol content in wines is just above 10%.

This plot shows the relationship between pH levels and the density of the wine. We can see that the density trends downward as pH levels rise.

I created two new datasets using dplyr’s group_by() method to group the data by quality ratings and alcohol content.

Here, we can see the average alcohol and pH content for each quality rating, along with the number of wines in each category.

## # A tibble: 6 x 7
##   alcohol  pH_mean sugar_mean density_mean sulphate_mean quality_mean
##     <dbl>    <dbl>      <dbl>        <dbl>         <dbl>        <dbl>
## 1    8.40 3.010000       1.95    1.0001000     0.7100000          4.5
## 2    8.50 3.150000       1.60    0.9991400     0.6500000          5.0
## 3    8.70 3.330000       2.10    0.9977500     0.8100000          6.0
## 4    8.80 3.160000      13.80    1.0024200     0.7500000          5.0
## 5    9.00 3.287667       3.06    0.9984173     0.6056667          5.4
## 6    9.05 3.390000       1.90    0.9958500     0.4300000          4.0
## # ... with 1 more variables: n <int>

The functions above show the average of certain wine characteristics, sorted by alcohol content. We then analyze one variable, sugar, to see if there is a relationship between the average sugar content and alcohol content. As you can see, there doesn’t seem to be a direct relationship.

However, there does seem to be a relationship with the average density of wine, and the alcohol content.

## 
##  Pearson's product-moment correlation
## 
## data:  redWine$fixed.acidity and redWine$volatile.acidity
## t = -10.589, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3013681 -0.2097433
## sample estimates:
##        cor 
## -0.2561309

This was a test to see if there is a correlation between the fixed acidity and the voliatile acidity using the Pearson method. The Pearson method states that any value above 0.3 or below -0.3 means the two variables are significantly correlated. The result for this test is -0.256, which is close!

## 
##  Pearson's product-moment correlation
## 
## data:  redWine$alcohol and redWine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

This test helps confirm the previous analysis - the quality and alcohol content seem to be significantly correlated.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

I started off trying to seek correlation between the quality of the wine and its characteristics. I initally thought that alcohol content may have a relationship with the quality rating. Also, I wanted to explore if any other characteristics of wine were directly correlated, or effected, by one another.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

One characteristic I thought was interesting was the overall increase in the alcohol content in relation to the quality rating. While I did imagine a higher alcohol content would result in people enjoying the wine more, I thought that it would have a limit, or not be as strongly correlated as it showed.

What was the strongest relationship you found?

The strongest relationship that I discovered through a plot seemed to be the alcohol content and quality rating, while the strongest relationship I discovered through the cor.test() function was between fixed and volatile acidity.

Multivariate Plots Section

This plot shows explores the alcohol content in wine compared to its pH level, sorted out by the quality rating. When considering that a dot means there are 5 wines that have those characteristics, we can see where the majority of the wines are.

This plot shows that the amount of sulphates seem to be in a consistent range between 0.5 to 1 as the alcohol content increases. It also shows the lighter blues towards the higher alcohol content, which also indicates that it’s of higher quality.

Here, we explore the relationship between acidity and pH. We can see that There is a higher percentage of volatile acidity as the pH increases (which means it is less acidic). From here, we can infer that fixed acidity is better for lower pH levels compared to volatile acidity.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

I wanted to see if there is any relationship between any of the characteristics of wine and its quality rating, but initially proposed the idea that there is no one direct relationship. Other than alcohol content, my additional analysis proves that it is not just one variable that is a reliable predictor of wine quality.

Were there any interesting or surprising interactions between features?

It was interesting to see that volatile acidity and its relationship with pH levels. When thinking of the word “voliatile”, you think of movement and action, which are some similar characterstics when I think of acidity. So it was interesting to me to see that it actually made the wine more basic.


Final Plots and Summary

Plot One

Description One

I chose this plot that compares alcohol content to its respective pH levels, separated by the wine’s quality rating. While we do not see any direct relationship, we can see that the majority of wines have an alcohol content of around 10%, with a pH level of between 3.0 and 3.5.

Plot Two

Description Two

I chose this plot beacuse the first question I had was “does the alcohol content have anything to do with its rating?” In my experience, people can just different wines with a heavy bias on its alcohol content. When I saw that the median was just above 10%, I understood that it was not judged with as significantly as I thought, even though most of the wine is rated a 5 or 6.

Plot Three

Description Three

I chose this plot because it combines everything I have learned so far to show the great detail of the amount of acidity in relation to the wine’s pH levels.


Reflection

Some struggles I found in exploring this dataset was trying to find a meaningful relationship between the variables that I can contribute to the overall quality of the wine. What did go well was disproving this idea that a singular factor can cause the quality of the wine to go up or down. It was surprising to see elements of the wine that you would not think to be linked together to show a dependent relationship. Moving forward, with datasets like this, additional work can be done, such as including more characteristics of the wine. This can allow for a greater search into what makes a particular wine rate higher than another.